Noriyuki TONAMI Keisuke IMOTO Ryosuke YAMANISHI Yoichi YAMASHITA
Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene “office,” the sound events “mouse clicking” and “keyboard typing” are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.
Sanghoon KANG Hanhoon PARK Jong-Il PARK
Image deformations caused by different steganographic methods are typically extremely small and highly similar, which makes their detection and identification to be a difficult task. Although recent steganalytic methods using deep learning have achieved high accuracy, they have been made to detect stego images to which specific steganographic methods have been applied. In this letter, a staganalytic method is proposed that uses hierarchical residual neural networks (ResNet), allowing detection (i.e. classification between stego and cover images) and identification of four spatial steganographic methods (i.e. LSB, PVD, WOW and S-UNIWARD). Experimental results show that using hierarchical ResNets achieves a classification rate of 79.71% in quinary classification, which is approximately 23% higher compared to using a plain convolutional neural network (CNN).
Zhenhui XU Tielong SHEN Daizhan CHENG
This paper studies the infinite time horizon optimal control problem for continuous-time nonlinear systems. A completely model-free approximate optimal control design method is proposed, which only makes use of the real-time measured data from trajectories instead of a dynamical model of the system. This approach is based on the actor-critic structure, where the weights of the critic neural network and the actor neural network are updated sequentially by the method of weighted residuals. It should be noted that an external input is introduced to replace the input-to-state dynamics to improve the control policy. Moreover, strict proof of convergence to the optimal solution along with the stability of the closed-loop system is given. Finally, a numerical example is given to show the efficiency of the method.
Lianqiang LI Kangbo SUN Jie ZHU
Knowledge distillation approaches can transfer information from a large network (teacher network) to a small network (student network) to compress and accelerate deep neural networks. This paper proposes a novel knowledge distillation approach called multi-knowledge distillation (MKD). MKD consists of two stages. In the first stage, it employs autoencoders to learn compact and precise representations of the feature maps (FM) from the teacher network and the student network, these representations can be treated as the essential of the FM, i.e., EFM. In the second stage, MKD utilizes multiple kinds of knowledge, i.e., the magnitude of individual sample's EFM and the similarity relationships among several samples' EFM to enhance the generalization ability of the student network. Compared with previous approaches that employ FM or the handcrafted features from FM, the EFM learned from autoencoders can be transferred more efficiently and reliably. Furthermore, the rich information provided by the multiple kinds of knowledge guarantees the student network to mimic the teacher network as closely as possible. Experimental results also show that MKD is superior to the-state-of-arts.
Kazuya URAZOE Nobutaka KUROKI Yu KATO Shinya OHTANI Tetsuya HIROSE Masahiro NUMA
This paper presents an image super-resolution technique using a convolutional neural network (CNN) and multi-task learning for multiple image categories. The image categories include natural, manga, and text images. Their features differ from each other. However, several CNNs for super-resolution are trained with a single category. If the input image category is different from that of the training images, the performance of super-resolution is degraded. There are two possible solutions to manage multi-categories with conventional CNNs. The first involves the preparation of the CNNs for every category. This solution, however, requires a category classifier to select an appropriate CNN. The second is to learn all categories with a single CNN. In this solution, the CNN cannot optimize its internal behavior for each category. Therefore, this paper presents a super-resolution CNN architecture for multiple image categories. The proposed CNN has two parallel outputs for a high-resolution image and a category label. The main CNN for the high-resolution image is a normal three convolutional layer-architecture, and the sub neural network for the category label is branched out from its middle layer and consists of two fully-connected layers. This architecture can simultaneously learn the high-resolution image and its category using multi-task learning. The category information is used for optimizing the super-resolution. In an applied setting, the proposed CNN can automatically estimate the input image category and change the internal behavior. Experimental results of 2× image magnification have shown that the average peak signal-to-noise ratio for the proposed method is approximately 0.22 dB higher than that for the conventional super-resolution with no difference in processing time and parameters. We have ensured that the proposed method is useful when the input image category is varying.
Tsunato NAKAI Daisuke SUZUKI Fumio OMATSU Takeshi FUJINO
Artificial intelligence (AI), especially deep learning (DL), has been remarkable and applied to various industries. However, adversarial examples (AE), which add small perturbations to input data of deep neural networks (DNNs) for misclassification, are attracting attention. In this paper, we propose a novel black-box attack to craft AE using only processing time which is side-channel information of DNNs, without using training data, model architecture and parameters, substitute models or output probability. While, several existing black-box attacks use output probability, our attack exploits a relationship between the number of activated nodes and the processing time of DNNs. The perturbations for AE are decided by the differential processing time according to input data in our attack. We show experimental results in which our attack's AE increase the number of activated nodes and cause misclassification to one of the incorrect labels effectively. In addition, the experimental results highlight that our attack can evade gradient masking countermeasures which mask output probability to prevent crafting AE against several black-box attacks.
Kota YOSHIDA Mitsuru SHIOZAKI Shunsuke OKURA Takaya KUBOTA Takeshi FUJINO
A model extraction attack is a security issue in deep neural networks (DNNs). Information on a trained DNN model is an attractive target for an adversary not only in terms of intellectual property but also of security. Thus, an adversary tries to reveal the sensitive information contained in the trained DNN model from machine-learning services. Previous studies on model extraction attacks assumed that the victim provides a machine-learning cloud service and the adversary accesses the service through formal queries. However, when a DNN model is implemented on an edge device, adversaries can physically access the device and try to reveal the sensitive information contained in the implemented DNN model. We call these physical model extraction attacks model reverse-engineering (MRE) attacks to distinguish them from attacks on cloud services. Power side-channel analyses are often used in MRE attacks to reveal the internal operation from power consumption or electromagnetic leakage. Previous studies, including ours, evaluated MRE attacks against several types of DNN processors with power side-channel analyses. In this paper, information leakage from a systolic array which is used for the matrix multiplication unit in the DNN processors is evaluated. We utilized correlation power analysis (CPA) for the MRE attack and reveal weight parameters of a DNN model from the systolic array. Two types of the systolic array were implemented on field-programmable gate array (FPGA) to demonstrate that CPA reveals weight parameters from those systolic arrays. In addition, we applied an extended analysis approach called “chain CPA” for robust CPA analysis against the systolic arrays. Our experimental results indicate that an adversary can reveal trained model parameters from a DNN accelerator even if the DNN model parameters in the off-chip bus are protected with data encryption. Countermeasures against side-channel leaks will be important for implementing a DNN accelerator on a FPGA or application-specific integrated circuit (ASIC).
Recurrent neural networks (RNNs) have been proven effective for sequence-based tasks thanks to their capability to process temporal information. In real-world systems, deep RNNs are more widely used to solve complicated tasks such as large-scale speech recognition and machine translation. However, the implementation of deep RNNs on traditional hardware platforms is inefficient due to long-range temporal dependence and irregular computation patterns within RNNs. This inefficiency manifests itself in the proportional increase in the latency of RNN inference with respect to the number of layers of deep RNNs on CPUs and GPUs. Previous work has focused mostly on optimizing and accelerating individual RNN cells. To make deep RNN inference fast and efficient, we propose an accelerator based on a multi-FPGA platform called Flow-in-Cloud (FiC). In this work, we show that the parallelism provided by the multi-FPGA system can be taken advantage of to scale up the inference of deep RNNs, by partitioning a large model onto several FPGAs, so that the latency stays close to constant with respect to increasing number of RNN layers. For single-layer and four-layer RNNs, our implementation achieves 31x and 61x speedup compared with an Intel CPU.
Yasuhiro NAKAHARA Masato KIYAMA Motoki AMAGASAKI Masahiro IIDA
Quantization is an important technique for implementing convolutional neural networks on edge devices. Quantization often requires relearning, but relearning sometimes cannot be always be applied because of issues such as cost or privacy. In such cases, it is important to know the numerical precision required to maintain accuracy. We accurately simulate calculations on hardware and accurately measure the relationship between accuracy and numerical precision.
Masayuki SHIMODA Youki SADA Ryosuke KURAMOCHI Shimpei SATO Hiroki NAKAHARA
In the realization of convolutional neural networks (CNNs) in resource-constrained embedded hardware, the memory footprint of weights is one of the primary problems. Pruning techniques are often used to reduce the number of weights. However, the distribution of nonzero weights is highly skewed, which makes it more difficult to utilize the underlying parallelism. To address this problem, we present SENTEI*, filter-wise pruning with distillation, to realize hardware-aware network architecture with comparable accuracy. The filter-wise pruning eliminates weights such that each filter has the same number of nonzero weights, and retraining with distillation retains the accuracy. Further, we develop a zero-weight skipping inter-layer pipelined accelerator on an FPGA. The equalization enables inter-filter parallelism, where a processing block for a layer executes filters concurrently with straightforward architecture. Our evaluation of semantic-segmentation tasks indicates that the resulting mIoU only decreased by 0.4 points. Additionally, the speedup and power efficiency of our FPGA implementation were 33.2× and 87.9× higher than those of the mobile GPU. Therefore, our technique realizes hardware-aware network with comparable accuracy.
Noriyuki MATSUNAGA Yamato OHTANI Tatsuya HIRAHARA
Deep neural network (DNN)-based speech synthesis became popular in recent years and is expected to soon be widely used in embedded devices and environments with limited computing resources. The key intention of these systems in poor computing environments is to reduce the computational cost of generating speech parameter sequences while maintaining voice quality. However, reducing computational costs is challenging for two primary conventional DNN-based methods used for modeling speech parameter sequences. In feed-forward neural networks (FFNNs) with maximum likelihood parameter generation (MLPG), the MLPG reconstructs the temporal structure of the speech parameter sequences ignored by FFNNs but requires additional computational cost according to the sequence length. In recurrent neural networks, the recursive structure allows for the generation of speech parameter sequences while considering temporal structures without the MLPG, but increases the computational cost compared to FFNNs. We propose a new approach for DNNs to acquire parameters captured from the temporal structure by backpropagating the errors of multiple attributes of the temporal sequence via the loss function. This method enables FFNNs to generate speech parameter sequences by considering their temporal structure without the MLPG. We generated the fundamental frequency sequence and the mel-cepstrum sequence with our proposed method and conventional methods, and then synthesized and subjectively evaluated the speeches from these sequences. The proposed method enables even FFNNs that work on a frame-by-frame basis to generate speech parameter sequences by considering the temporal structure and to generate sequences perceptually superior to those from the conventional methods.
Keisuke MAEDA Kazaha HORII Takahiro OGAWA Miki HASEYAMA
A multi-task convolutional neural network leading to high performance and interpretability via attribute estimation is presented in this letter. Our method can provide interpretation of the classification results of CNNs by outputting attributes that explain elements of objects as a judgement reason of CNNs in the middle layer. Furthermore, the proposed network uses the estimated attributes for the following prediction of classes. Consequently, construction of a novel multi-task CNN with improvements in both of the interpretability and classification performance is realized.
Xue NI Huali WANG Ying ZHU Fan MENG
Low Probability of Intercept (LPI) radar waveform has complex and diverse modulation schemes, which cannot be easily identified by the traditional methods. The research on intrapulse modulation LPI radar waveform recognition has received increasing attention. In this paper, we propose an automatic LPI radar waveform recognition algorithm that uses a multi-resolution fusion convolutional neural network. First, signals embedded within the noise are processed using Choi-William Distribution (CWD) to obtain time-frequency feature images. Then, the images are resized by interpolation and sent to the proposed network for training and identification. The network takes a dual-channel CNN structure to obtain features at different resolutions and makes features fusion by using the concatenation and Inception module. Extensive simulations are carried out on twelve types of LPI radar waveforms, including BPSK, Costas, Frank, LFM, P1~P4, and T1~T4, corrupted with additive white Gaussian noise of SNR from 10dB to -8dB. The results show that the overall recognition rate of the proposed algorithm reaches 95.1% when the SNR is -6dB. We also try various sample selection methods related to the recognition task of the system. The conclusion is that reducing the samples with SNR above 2dB or below -8dB can effectively improve the training speed of the network while maintaining recognition accuracy.
Junya KOGUCHI Shinnosuke TAKAMICHI Masanori MORISE Hiroshi SARUWATARI Shigeki SAGAYAMA
We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
Naohisa NISHIDA Tatsumi OBA Yuji UNAGAMI Jason PAUL CRUZ Naoto YANAI Tadanori TERUYA Nuttapong ATTRAPADUNG Takahiro MATSUDA Goichiro HANAOKA
Machine learning models inherently memorize significant amounts of information, and thus hiding not only prediction processes but also trained models, i.e., model obliviousness, is desirable in the cloud setting. Several works achieved model obliviousness with the MNIST dataset, but datasets that include complicated samples, e.g., CIFAR-10 and CIFAR-100, are also used in actual applications, such as face recognition. Secret sharing-based secure prediction for CIFAR-10 is difficult to achieve. When a deep layer architecture such as CNN is used, the calculation error when performing secret calculation becomes large and the accuracy deteriorates. In addition, if detailed calculations are performed to improve accuracy, a large amount of calculation is required. Therefore, even if the conventional method is applied to CNN as it is, good results as described in the paper cannot be obtained. In this paper, we propose two approaches to solve this problem. Firstly, we propose a new protocol named Batch-normalizedActivation that combines BatchNormalization and Activation. Since BatchNormalization includes real number operations, when performing secret calculation, parameters must be converted into integers, which causes a calculation error and decrease accuracy. By using our protocol, calculation errors can be eliminated, and accuracy degradation can be eliminated. Further, the processing is simplified, and the amount of calculation is reduced. Secondly, we explore a secret computation friendly and high accuracy architecture. Related works use a low-accuracy, simple architecture, but in reality, a high accuracy architecture should be used. Therefore, we also explored a high accuracy architecture for the CIFAR10 dataset. Our proposed protocol can compute prediction of CIFAR-10 within 15.05 seconds with 87.36% accuracy while providing model obliviousness.
Koji KUDO Keita MORIMOTO Akito IGUCHI Yasuhide TSUJI
We propose a new design approach to improve the computational efficiency of an optimal design of optical waveguide devices utilizing coupled mode theory (CMT) and a neural network (NN). Recently, the NN has begun to be used for efficient optimal design of optical devices. In this paper, the eigenmode analysis required in the CMT is skipped by using the NN, and optimization with an evolutionary algorithm can be efficiently carried out. To verify usefulness of our approach, optimal design examples of a wavelength insensitive 3dB coupler, a 1 : 2 power splitter, and a wavelength demultiplexer are shown and their transmission properties obtained by the CMT with the NN (NN-CMT) are verified by comparing with those calculated by a finite element beam propagation method (FE-BPM).
Satomu YASUDA Yukihisa SUZUKI Keiji WADA
An active gate driver IC generates arbitrary switching waveform is proposed to reduce the switching loss, the voltage overshoot, and the electromagnetic interference (EMI) by optimizing the switching pattern. However, it is hard to find optimal switching pattern because the switching pattern has huge possible combinations. In this paper, the method to estimate the switching loss and the voltage overshoot from the switching pattern with neural network (NN) is proposed. The implemented NN model obtains reasonable learning results for data-sets.
In recent years, deep neural network (DNN) has achieved considerable results on many artificial intelligence tasks, e.g. natural language processing. However, the computation complexity of DNN is extremely high. Furthermore, the performance of traditional von Neumann computing architecture has been slowing down due to the memory wall problem. Processing in memory (PIM), which places computation within memory and reduces the data movement, breaks the memory wall. ReRAM PIM is thought to be a available architecture for DNN accelerators. In this work, a novel design of ReRAM neuromorphic system is proposed to process DNN fully in array efficiently. The binary ReRAM array is composed of 2T2R storage cells and current mirror sense amplifiers. A dummy BL reference scheme is proposed for reference voltage generation. A binary DNN (BDNN) model is then constructed and optimized on MNIST dataset. The model reaches a validation accuracy of 96.33% and is deployed to the ReRAM PIM system. Co-design model optimization method between hardware device and software algorithm is proposed with the idea of utilizing hardware variance information as uncertainness in optimization procedure. This method is analyzed to achieve feasible hardware design and generalizable model. Deployed with such co-design model, ReRAM array processes DNN with high robustness against fabrication fluctuation.
Degen HUANG Anil AHMED Syed Yasser ARAFAT Khawaja Iftekhar RASHID Qasim ABBAS Fuji REN
Neural networks have received considerable attention in sentence similarity measuring systems due to their efficiency in dealing with semantic composition. However, existing neural network methods are not sufficiently effective in capturing the most significant semantic information buried in an input. To address this problem, a novel weighted-pooling attention layer is proposed to retain the most remarkable attention vector. It has already been established that long short-term memory and a convolution neural network have a strong ability to accumulate enriched patterns of whole sentence semantic representation. First, a sentence representation is generated by employing a siamese structure based on bidirectional long short-term memory and a convolutional neural network. Subsequently, a weighted-pooling attention layer is applied to obtain an attention vector. Finally, the attention vector pair information is leveraged to calculate the score of sentence similarity. An amalgamation of both, bidirectional long short-term memory and a convolutional neural network has resulted in a model that enhances information extracting and learning capacity. Investigations show that the proposed method outperforms the state-of-the-art approaches to datasets for two tasks, namely semantic relatedness and Microsoft research paraphrase identification. The new model improves the learning capability and also boosts the similarity accuracy as well.
Ryuta SHINGAI Yuria HIRAGA Hisakazu FUKUOKA Takamasa MITANI Takashi NAKADA Yasuhiko NAKASHIMA
Modern deep learning has significantly improved performance and has been used in a wide variety of applications. Since the amount of computation required for the inference process of the neural network is large, it is processed not by the data acquisition location like a surveillance camera but by the server with abundant computing power installed in the data center. Edge computing is getting considerable attention to solve this problem. However, edge computing can provide limited computation resources. Therefore, we assumed a divided/distributed neural network model using both the edge device and the server. By processing part of the convolution layer on edge, the amount of communication becomes smaller than that of the sensor data. In this paper, we have evaluated AlexNet and the other eight models on the distributed environment and estimated FPS values with Wi-Fi, 3G, and 5G communication. To reduce communication costs, we also introduced the compression process before communication. This compression may degrade the object recognition accuracy. As necessary conditions, we set FPS to 30 or faster and object recognition accuracy to 69.7% or higher. This value is determined based on that of an approximation model that binarizes the activation of Neural Network. We constructed performance and energy models to find the optimal configuration that consumes minimum energy while satisfying the necessary conditions. Through the comprehensive evaluation, we found that the optimal configurations of all nine models. For small models, such as AlexNet, processing entire models in the edge was the best. On the other hand, for huge models, such as VGG16, processing entire models in the server was the best. For medium-size models, the distributed models were good candidates. We confirmed that our model found the most energy efficient configuration while satisfying FPS and accuracy requirements, and the distributed models successfully reduced the energy consumption up to 48.6%, and 6.6% on average. We also found that HEVC compression is important before transferring the input data or the feature data between the distributed inference processes.